The available data is comprised of records containing a range of different types of measurements from motion sensors:
The motion sensors were placed on four different places in human subjects’ bodies and on the dumbbell itself:
There is a column for each one of these measurements in the provided data-sets (*.csv files). We are going to explore these variables in more detail in the sections that follow.
The objective of this analysis is to classify each record in the testing data-set as belonging to one of the following classes (column classe, “exercise class”):
| Class Name | Description |
|---|---|
| A | Sitting (Still) |
| B | Sitting Down |
| C | Standing (Still) |
| D | Standing Up |
| E | Walking |
The human subjects are six different users identified by the variable user_name.
I believe this summarizes everything we need to know about the nature of the information present in the data-sets, but you can find more detailed information here.
You might need to install the following packages if you don’t already have them:
install.packages("plyr")
install.packages("dplyr")
install.packages("stringr")
install.packages("Amelia")
install.packages("xtable")
install.packages("ggplot2")
install.packages("gridExtra")
install.packages("caret")
install.packages("randomForest")
install.packages("e1071")
install.packages("rpart")
install.packages("tictoc")
install.packages("reshape2")
Just uncomment the packages you need and run this chunk before you run the remaining ones in this notebook.
Once the libraries are installed, they need to be loaded as follows:
suppressMessages(library(plyr, warn.conflicts = FALSE)) # Manipulating dataframes
suppressMessages(library(dplyr, warn.conflicts = FALSE))
suppressMessages(library(stringr)) # String operations
suppressMessages(library(Amelia)) # Missing data plots
suppressMessages(library(xtable)) # Pretty printing dataframes
suppressMessages(library(ggplot2)) # Plotting
suppressMessages(library(gridExtra))
suppressMessages(library(reshape2))
suppressMessages(library(caret)) # Machine learning
suppressMessages(library(randomForest))
suppressMessages(library(e1071))
suppressMessages(library(rpart))
suppressMessages(library(tictoc)) # Benchmarking
Let’s first load the data-sets:
training <- read.csv('../input/pml-training.csv', na.strings=c("NA", "NULL", ""), stringsAsFactors = F)
testing <- read.csv('../input/pml-testing.csv', na.strings=c("NA", "NULL", ""), stringsAsFactors = F)
I also want to create a data-set which has all records. There’s an odd thing about these data-sets though: The record ID (X) starts over from one in the testing data-set, suggesting that the testing and training data-set were not once a whole data-set. I’m going to add the new column record_id so we can distinguish training and testing records more easily:
training_rows <- nrow(training)
testing_rows <- nrow(testing)
all_rows <- training_rows + testing_rows
training$record_id <- 1:training_rows
testing$record_id <- (training_rows+1):all_rows
all <- rbind.fill(training, testing)
I’m going to introduce a couple of functions before going forward. Data frames show nicely in RStudio, but they look terrible in the HTML output, so I have created a user defined function which renders data frames into HTML tables:
render_table_in_viewer_pane <- function(data, digits) {
html <- print(xtable(data, digits = digits), type = "html", print.results=FALSE)
temp <- tempfile(fileext = ".html")
cat(html, file = temp)
rstudioapi::viewer(temp)
}
render_table <- function(data, digits = 2) {
print(xtable(data, digits = digits), type = "html")
}
I have also created a function for creating a summary of record count:
data_count <- function(training, testing) {
result <- data.frame(measurement = "training records", count = nrow(training))
result <- rbind(result, data.frame(measurement = "testing records", count = nrow(testing)))
result <- rbind(result, data.frame(measurement = "total records", count = nrow(training) + nrow(testing)))
result <- rbind(result, data.frame(measurement = "variables", count = length(names(training))))
return(result)
}
Now we can pretty print the summary using this function:
render_table(data_count(training, testing))
| measurement | count | |
|---|---|---|
| 1 | training records | 19622 |
| 2 | testing records | 20 |
| 3 | total records | 19642 |
| 4 | variables | 161 |
render_table() will also render the HTML table in RStudio’s “Viewer”" (panel that should be visible on your right). You should be able to see it after executing this chunk.
It’s always a good idea to know about missing values from the start. I have created a function for this purpose. Another option would be using functions present in the “Amelia” package, but due to the large number of missing values (NAs) in this data, the miss maps just don’t look good.
missing_summary <- function(data) {
count <- data %>% summarise_each(funs(sum(is.na(.))))
output <- data.frame(variable = names(count), missing_count = t(count))
rownames(output) <- 1:nrow(output)
output <- output[output$missing_count > 0, ]
return(output)
}
missing <- missing_summary(all[, !names(all) %in% c("classe") ])
render_table(missing)
| variable | missing_count | |
|---|---|---|
| 12 | kurtosis_roll_belt | 19236 |
| 13 | kurtosis_picth_belt | 19236 |
| 14 | kurtosis_yaw_belt | 19236 |
| 15 | skewness_roll_belt | 19236 |
| 16 | skewness_roll_belt.1 | 19236 |
| 17 | skewness_yaw_belt | 19236 |
| 18 | max_roll_belt | 19236 |
| 19 | max_picth_belt | 19236 |
| 20 | max_yaw_belt | 19236 |
| 21 | min_roll_belt | 19236 |
| 22 | min_pitch_belt | 19236 |
| 23 | min_yaw_belt | 19236 |
| 24 | amplitude_roll_belt | 19236 |
| 25 | amplitude_pitch_belt | 19236 |
| 26 | amplitude_yaw_belt | 19236 |
| 27 | var_total_accel_belt | 19236 |
| 28 | avg_roll_belt | 19236 |
| 29 | stddev_roll_belt | 19236 |
| 30 | var_roll_belt | 19236 |
| 31 | avg_pitch_belt | 19236 |
| 32 | stddev_pitch_belt | 19236 |
| 33 | var_pitch_belt | 19236 |
| 34 | avg_yaw_belt | 19236 |
| 35 | stddev_yaw_belt | 19236 |
| 36 | var_yaw_belt | 19236 |
| 50 | var_accel_arm | 19236 |
| 51 | avg_roll_arm | 19236 |
| 52 | stddev_roll_arm | 19236 |
| 53 | var_roll_arm | 19236 |
| 54 | avg_pitch_arm | 19236 |
| 55 | stddev_pitch_arm | 19236 |
| 56 | var_pitch_arm | 19236 |
| 57 | avg_yaw_arm | 19236 |
| 58 | stddev_yaw_arm | 19236 |
| 59 | var_yaw_arm | 19236 |
| 69 | kurtosis_roll_arm | 19236 |
| 70 | kurtosis_picth_arm | 19236 |
| 71 | kurtosis_yaw_arm | 19236 |
| 72 | skewness_roll_arm | 19236 |
| 73 | skewness_pitch_arm | 19236 |
| 74 | skewness_yaw_arm | 19236 |
| 75 | max_roll_arm | 19236 |
| 76 | max_picth_arm | 19236 |
| 77 | max_yaw_arm | 19236 |
| 78 | min_roll_arm | 19236 |
| 79 | min_pitch_arm | 19236 |
| 80 | min_yaw_arm | 19236 |
| 81 | amplitude_roll_arm | 19236 |
| 82 | amplitude_pitch_arm | 19236 |
| 83 | amplitude_yaw_arm | 19236 |
| 87 | kurtosis_roll_dumbbell | 19236 |
| 88 | kurtosis_picth_dumbbell | 19236 |
| 89 | kurtosis_yaw_dumbbell | 19236 |
| 90 | skewness_roll_dumbbell | 19236 |
| 91 | skewness_pitch_dumbbell | 19236 |
| 92 | skewness_yaw_dumbbell | 19236 |
| 93 | max_roll_dumbbell | 19236 |
| 94 | max_picth_dumbbell | 19236 |
| 95 | max_yaw_dumbbell | 19236 |
| 96 | min_roll_dumbbell | 19236 |
| 97 | min_pitch_dumbbell | 19236 |
| 98 | min_yaw_dumbbell | 19236 |
| 99 | amplitude_roll_dumbbell | 19236 |
| 100 | amplitude_pitch_dumbbell | 19236 |
| 101 | amplitude_yaw_dumbbell | 19236 |
| 103 | var_accel_dumbbell | 19236 |
| 104 | avg_roll_dumbbell | 19236 |
| 105 | stddev_roll_dumbbell | 19236 |
| 106 | var_roll_dumbbell | 19236 |
| 107 | avg_pitch_dumbbell | 19236 |
| 108 | stddev_pitch_dumbbell | 19236 |
| 109 | var_pitch_dumbbell | 19236 |
| 110 | avg_yaw_dumbbell | 19236 |
| 111 | stddev_yaw_dumbbell | 19236 |
| 112 | var_yaw_dumbbell | 19236 |
| 125 | kurtosis_roll_forearm | 19236 |
| 126 | kurtosis_picth_forearm | 19236 |
| 127 | kurtosis_yaw_forearm | 19236 |
| 128 | skewness_roll_forearm | 19236 |
| 129 | skewness_pitch_forearm | 19236 |
| 130 | skewness_yaw_forearm | 19236 |
| 131 | max_roll_forearm | 19236 |
| 132 | max_picth_forearm | 19236 |
| 133 | max_yaw_forearm | 19236 |
| 134 | min_roll_forearm | 19236 |
| 135 | min_pitch_forearm | 19236 |
| 136 | min_yaw_forearm | 19236 |
| 137 | amplitude_roll_forearm | 19236 |
| 138 | amplitude_pitch_forearm | 19236 |
| 139 | amplitude_yaw_forearm | 19236 |
| 141 | var_accel_forearm | 19236 |
| 142 | avg_roll_forearm | 19236 |
| 143 | stddev_roll_forearm | 19236 |
| 144 | var_roll_forearm | 19236 |
| 145 | avg_pitch_forearm | 19236 |
| 146 | stddev_pitch_forearm | 19236 |
| 147 | var_pitch_forearm | 19236 |
| 148 | avg_yaw_forearm | 19236 |
| 149 | stddev_yaw_forearm | 19236 |
| 150 | var_yaw_forearm | 19236 |
| 161 | problem_id | 19622 |
We want to do imputation whenever is possible, but given the huge number of missing values for a number of columns in the data-sets (19216 out of 19622 are NAs in the training data-set) the only thing we can do is remove these columns:
remove_columns <- function(data, columns) {
data <- data[, ! names(data) %in% columns]
return(data)
}
extract_data_by_record_id <- function(data_from, ids_from) return(data_from[data_from$record_id %in% ids_from$record_id, ])
all <- remove_columns(all, missing$variable)
training <- extract_data_by_record_id(all, training)
testing <- extract_data_by_record_id(all, testing)
Now we can take a sample from the data and see how it looks like:
sample_data_frame <- function(data, size) {
sample_index <- sample(1:nrow(data), size)
return(data[sample_index, ])
}
sample <- sample_data_frame(training, 10)
render_table(sample)
| X | user_name | raw_timestamp_part_1 | raw_timestamp_part_2 | cvtd_timestamp | new_window | num_window | roll_belt | pitch_belt | yaw_belt | total_accel_belt | gyros_belt_x | gyros_belt_y | gyros_belt_z | accel_belt_x | accel_belt_y | accel_belt_z | magnet_belt_x | magnet_belt_y | magnet_belt_z | roll_arm | pitch_arm | yaw_arm | total_accel_arm | gyros_arm_x | gyros_arm_y | gyros_arm_z | accel_arm_x | accel_arm_y | accel_arm_z | magnet_arm_x | magnet_arm_y | magnet_arm_z | roll_dumbbell | pitch_dumbbell | yaw_dumbbell | total_accel_dumbbell | gyros_dumbbell_x | gyros_dumbbell_y | gyros_dumbbell_z | accel_dumbbell_x | accel_dumbbell_y | accel_dumbbell_z | magnet_dumbbell_x | magnet_dumbbell_y | magnet_dumbbell_z | roll_forearm | pitch_forearm | yaw_forearm | total_accel_forearm | gyros_forearm_x | gyros_forearm_y | gyros_forearm_z | accel_forearm_x | accel_forearm_y | accel_forearm_z | magnet_forearm_x | magnet_forearm_y | magnet_forearm_z | classe | record_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7004 | 7004 | jeremy | 1322673076 | 86744 | 30/11/2011 17:11 | no | 440 | 0.38 | 3.75 | -88.50 | 5 | -0.13 | -0.03 | -0.03 | -13 | 5 | 45 | 39 | 634 | -302 | 0.00 | 0.00 | 0.00 | 19 | -3.48 | 1.59 | -0.56 | -57 | 151 | -93 | 616 | 167 | 316 | 56.37 | -51.89 | -75.39 | 29 | 0.37 | 0.13 | -0.34 | -142 | 153 | -196 | -505 | 356 | 46.00 | -163.00 | 44.20 | -91.50 | 41 | -0.18 | 1.03 | -0.23 | -98 | -283 | -269 | -129 | -612.00 | -31.00 | B | 7004 |
| 10543 | 10543 | carlitos | 1323084310 | 520389 | 05/12/2011 11:25 | no | 381 | 2.06 | 5.97 | -92.90 | 2 | -0.03 | -0.02 | 0.07 | -13 | 2 | 19 | 6 | 599 | -320 | 69.80 | -29.00 | 22.60 | 18 | -0.43 | 0.42 | -0.41 | 80 | 48 | -148 | 725 | -64 | 232 | -123.04 | -22.37 | 25.22 | 4 | -0.03 | -0.11 | 0.07 | -8 | -34 | 9 | -579 | 215 | 25.00 | 145.00 | 42.70 | 135.00 | 34 | -0.55 | 2.73 | 0.38 | -110 | 252 | -190 | -656 | 472.00 | 572.00 | C | 10543 |
| 13359 | 13359 | charles | 1322837917 | 856278 | 02/12/2011 14:58 | no | 131 | 119.00 | 16.20 | -1.93 | 18 | 0.08 | 0.11 | -0.15 | -20 | 63 | -162 | 23 | 595 | -336 | -44.40 | -22.40 | -14.80 | 16 | 0.72 | -0.58 | 0.82 | 82 | -105 | -74 | 695 | 17 | 198 | 109.23 | 41.53 | -22.38 | 5 | -0.10 | -0.18 | -0.16 | 20 | 44 | -11 | -363 | 501 | -49.00 | -74.10 | 17.60 | -164.00 | 52 | -0.45 | 0.75 | -0.20 | -433 | -266 | -20 | -474 | -305.00 | 134.00 | D | 13359 |
| 9836 | 9836 | eurico | 1322489668 | 294676 | 28/11/2011 14:14 | no | 264 | 1.37 | 1.90 | -88.20 | 2 | 0.05 | 0.00 | -0.03 | -6 | 3 | 23 | 27 | 567 | -440 | 89.00 | 4.18 | 90.60 | 35 | -3.60 | 1.38 | -0.33 | -222 | 254 | 27 | -45 | 295 | 622 | -92.18 | -31.14 | -54.79 | 7 | 0.39 | -0.40 | -0.07 | -20 | -52 | -34 | -497 | 119 | 281.00 | 177.00 | 20.00 | -81.00 | 24 | 0.45 | -3.84 | -1.28 | 4 | -118 | 207 | 139 | -771.00 | 130.00 | C | 9836 |
| 7160 | 7160 | jeremy | 1322673083 | 874781 | 30/11/2011 17:11 | no | 446 | 1.33 | 5.46 | -87.90 | 4 | -0.06 | 0.02 | 0.02 | -23 | 1 | 34 | 54 | 632 | -317 | 0.00 | 0.00 | 0.00 | 35 | -1.20 | 0.87 | -0.26 | -241 | 240 | -54 | -343 | 449 | 397 | 55.47 | -49.01 | -78.64 | 31 | 0.66 | -0.26 | -1.15 | -143 | 160 | -215 | -539 | 288 | 91.00 | 148.00 | 6.88 | 134.00 | 46 | 1.67 | -5.41 | -0.75 | 160 | 411 | -96 | -587 | 685.00 | 866.00 | B | 7160 |
| 763 | 763 | adelmo | 1322832775 | 605371 | 02/12/2011 13:32 | no | 177 | 130.00 | -39.80 | 166.00 | 19 | 0.08 | 0.11 | -0.21 | 46 | 44 | -174 | 182 | 587 | -360 | -74.80 | 21.50 | 105.00 | 24 | 0.47 | -0.59 | 0.85 | -220 | -67 | 18 | -137 | 192 | 663 | 47.95 | -26.28 | -101.29 | 19 | 0.16 | 0.08 | -0.25 | -49 | 87 | -160 | -571 | 260 | 2.00 | 0.00 | 0.00 | 0.00 | 13 | 0.16 | 1.03 | 0.21 | 76 | 94 | 37 | 659 | -128.00 | 56.00 | A | 763 |
| 10485 | 10485 | carlitos | 1323084307 | 136427 | 05/12/2011 11:25 | no | 379 | 1.57 | 6.16 | -92.90 | 3 | -0.06 | 0.02 | -0.05 | -16 | 3 | 29 | 3 | 606 | -305 | 66.30 | -32.80 | 30.00 | 17 | -0.48 | 0.48 | -0.52 | 79 | 46 | -138 | 714 | -79 | 249 | -120.75 | -5.85 | 35.81 | 3 | 0.05 | -0.16 | -0.07 | -2 | -32 | 12 | -590 | 195 | 12.00 | 143.00 | 37.10 | 136.00 | 32 | 0.14 | 3.08 | 0.56 | -67 | 235 | -191 | -651 | 468.00 | 528.00 | C | 10485 |
| 79 | 79 | carlitos | 1323084235 | 72306 | 05/12/2011 11:23 | no | 15 | 1.14 | 7.31 | -94.00 | 3 | 0.02 | 0.00 | -0.03 | -18 | 2 | 22 | -7 | 602 | -304 | -131.00 | 17.90 | -162.00 | 34 | 0.00 | -0.02 | 0.02 | -288 | 109 | -125 | -368 | 335 | 516 | 13.33 | -70.85 | -84.45 | 37 | 0.02 | -0.02 | -0.02 | -235 | 48 | -270 | -555 | 294 | -66.00 | 23.40 | -63.60 | -147.00 | 36 | 0.05 | -0.02 | -0.02 | 189 | 207 | -217 | -9 | 656.00 | 476.00 | A | 79 |
| 5853 | 5853 | pedro | 1323095009 | 941345 | 05/12/2011 14:23 | no | 80 | 122.00 | 25.90 | -4.22 | 19 | -0.43 | -0.02 | -0.41 | -39 | 71 | -167 | 2 | 587 | -352 | 133.00 | 21.60 | -38.60 | 48 | 1.20 | -1.48 | 0.31 | 365 | -130 | -271 | 466 | -115 | -483 | -99.07 | -22.23 | 52.57 | 13 | 0.47 | 0.83 | -0.92 | -29 | -110 | 66 | 276 | -693 | -149.00 | 118.00 | 43.10 | 122.00 | 38 | -0.79 | 4.26 | 1.79 | 2 | 346 | -145 | -690 | 560.00 | 929.00 | B | 5853 |
| 5721 | 5721 | pedro | 1323095003 | 376329 | 05/12/2011 14:23 | no | 75 | 121.00 | 26.60 | -4.47 | 18 | -0.50 | -0.03 | -0.41 | -42 | 72 | -161 | -1 | 588 | -360 | 180.00 | -43.10 | 26.40 | 33 | -3.40 | 1.20 | 0.30 | 316 | 6 | 57 | 732 | 84 | 42 | -93.96 | -15.13 | 60.71 | 9 | 0.42 | 0.67 | -0.66 | -14 | -75 | 53 | 323 | -673 | -153.00 | -143.00 | 30.00 | -88.00 | 35 | -0.35 | 0.85 | 0.00 | -60 | -229 | -242 | -124 | -623.00 | -13.00 | B | 5721 |
Looking much better now. Let’s check for missing values once again:
render_table(missing_summary(training))
| variable | missing_count |
|---|
No missing values to be found. Looking good.
We should do the same checking for the testing data-set as well:
render_table(missing_summary(testing))
| variable | missing_count | |
|---|---|---|
| 60 | classe | 20 |
Only classe is missing, which is ok since the testing data-set isn’t labeled.
read.csv() will assign types to variables automatically. You can always check your data types evoking str():
str(training)
## 'data.frame': 19622 obs. of 61 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : chr "carlitos" "carlitos" "carlitos" "carlitos" ...
## $ raw_timestamp_part_1: int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2: int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : chr "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
## $ new_window : chr "no" "no" "no" "no" ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ total_accel_dumbbell: int 37 37 37 37 37 37 37 37 37 37 ...
## $ gyros_dumbbell_x : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ gyros_dumbbell_z : num 0 0 0 -0.02 0 0 0 0 0 0 ...
## $ accel_dumbbell_x : int -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
## $ accel_dumbbell_y : int 47 47 46 48 48 48 47 46 47 48 ...
## $ accel_dumbbell_z : int -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
## $ magnet_dumbbell_x : int -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
## $ magnet_dumbbell_y : int 293 296 298 303 292 294 295 300 292 291 ...
## $ magnet_dumbbell_z : num -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
## $ roll_forearm : num 28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
## $ yaw_forearm : num -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ gyros_forearm_x : num 0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
## $ gyros_forearm_y : num 0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
## $ gyros_forearm_z : num -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
## $ accel_forearm_x : int 192 192 196 189 189 193 195 193 193 190 ...
## $ accel_forearm_y : int 203 203 204 206 206 203 205 205 204 205 ...
## $ accel_forearm_z : int -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
## $ magnet_forearm_x : int -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
## $ magnet_forearm_y : num 654 661 658 658 655 660 659 660 653 656 ...
## $ magnet_forearm_z : num 476 473 469 469 473 478 470 474 476 473 ...
## $ classe : chr "A" "A" "A" "A" ...
## $ record_id : int 1 2 3 4 5 6 7 8 9 10 ...
Note that most variables are numeric (either integers or floats). The exceptions being user_name, cvtd_timestamp, new_window and classe (all strings).
str()’s output is a bit hard to read, so I’m going to group the variable names together according with where the motion sensor is placed:
matching_variables <- function(data, pattern) {
return(names(data)[grep(pattern, names(data))])
}
variable_summary <- function(data, patterns) {
result <- data.frame(pattern = character(0), variables = character(0))
for(pattern in patterns) {
result <- rbind(result, data.frame(pattern = pattern, variables = paste(matching_variables(training, paste0(".*", pattern, ".*")), collapse=", ")))
}
return(result)
}
render_table(variable_summary(training, c("belt", "[^fore]+arm", "forearm", "dumbbell")))
| pattern | variables | |
|---|---|---|
| 1 | belt | roll_belt, pitch_belt, yaw_belt, total_accel_belt, gyros_belt_x, gyros_belt_y, gyros_belt_z, accel_belt_x, accel_belt_y, accel_belt_z, magnet_belt_x, magnet_belt_y, magnet_belt_z |
| 2 | [^fore]+arm | roll_arm, pitch_arm, yaw_arm, total_accel_arm, gyros_arm_x, gyros_arm_y, gyros_arm_z, accel_arm_x, accel_arm_y, accel_arm_z, magnet_arm_x, magnet_arm_y, magnet_arm_z |
| 3 | forearm | roll_forearm, pitch_forearm, yaw_forearm, total_accel_forearm, gyros_forearm_x, gyros_forearm_y, gyros_forearm_z, accel_forearm_x, accel_forearm_y, accel_forearm_z, magnet_forearm_x, magnet_forearm_y, magnet_forearm_z |
| 4 | dumbbell | roll_dumbbell, pitch_dumbbell, yaw_dumbbell, total_accel_dumbbell, gyros_dumbbell_x, gyros_dumbbell_y, gyros_dumbbell_z, accel_dumbbell_x, accel_dumbbell_y, accel_dumbbell_z, magnet_dumbbell_x, magnet_dumbbell_y, magnet_dumbbell_z |
The function variable_summary() simply looks up and groups together variables which names match the giving input pattern.
We will notice that these variables names are consistent with the experiment setup we have described in the overview section. To be completely honest, I did this variable exploration first and then added more information to the overview section later. I also relied on information from the study that provided the data and from this document on motion sensors, but I believe that you should rely more heavily on the data exploration.
A description for the remaining variables in the data-set follows:
| Measurement | Comment |
|---|---|
| X | Record ID. |
| user_name | Username of human subject being measured. |
| raw_timestamp_part_1 | First part of the timestamp (which is broken into two variables (due to integer size restrictions in the apparatus it seems). |
| raw_timestamp_part_2 | Second part of the timestamp. |
| cvtd_timestamp | Timestamp formatted as a string: “%d/%m/%y %h:%m”. |
| new_window | Is the window a new window: “yes” or “no”. |
| num_window | Window ID. |
| classe | “exercise class” with values A, B, C, D & E, as described in the overview section. |
All measurements are performed inside a one second time window:
Motion Sensor Time Window
Note that there is window overlap. In some of the scatter plots you will find multiple horizontal parallel lines, which are explained by the multiple parallel variable measurements in each given timestamp.
I was a bit insecure about what the variable new_window represents, so decided to investigate further and created a function to extract records for a given num_window sorted by the timestamp:
extract_window <- function(data, num_window) {
output <- data[data$num_window == num_window, ]
output <- output[order(output$raw_timestamp_part_1, output$raw_timestamp_part_2), ]
return(output[, names(output) %in% c("raw_timestamp_part_1", "raw_timestamp_part_2", "new_window", "num_window")])
}
render_table(extract_window(training, num_window = 12))
| raw_timestamp_part_1 | raw_timestamp_part_2 | new_window | num_window | |
|---|---|---|---|---|
| 4 | 1323084232 | 120339 | no | 12 |
| 5 | 1323084232 | 196328 | no | 12 |
| 6 | 1323084232 | 304277 | no | 12 |
| 7 | 1323084232 | 368296 | no | 12 |
| 8 | 1323084232 | 440390 | no | 12 |
| 9 | 1323084232 | 484323 | no | 12 |
| 10 | 1323084232 | 484434 | no | 12 |
| 11 | 1323084232 | 500302 | no | 12 |
| 12 | 1323084232 | 528316 | no | 12 |
| 13 | 1323084232 | 560359 | no | 12 |
| 14 | 1323084232 | 576390 | no | 12 |
| 15 | 1323084232 | 604281 | no | 12 |
| 16 | 1323084232 | 644302 | no | 12 |
| 17 | 1323084232 | 692324 | no | 12 |
| 18 | 1323084232 | 732306 | no | 12 |
| 19 | 1323084232 | 740353 | no | 12 |
| 20 | 1323084232 | 788335 | no | 12 |
| 21 | 1323084232 | 876301 | no | 12 |
| 22 | 1323084232 | 892313 | no | 12 |
| 23 | 1323084232 | 932285 | no | 12 |
| 24 | 1323084232 | 996313 | yes | 12 |
If you extract windows for different num_window values, you will find this very same pattern over and over. new_window seems to indicate that the measurement is the last within a given window and that num_window should be incremented for the next measurement.
raw_timestamp_part_1 and raw_timestamp_par_2I’m going to create some additional columns, which will make data a bit easier to work with. For starters, there’s no need to keep the timestamp separated into two different columns (that’s a limitation from the motion sensor apparatus):
add_timestamp <- function(data) {
data$raw_timestamp <- data$raw_timestamp_part_1 + data$raw_timestamp_part_2
return(data)
}
all <- add_timestamp(all)
training <- extract_data_by_record_id(all, training)
testing <- extract_data_by_record_id(all, testing)
NOTE: When imputating missing values or creating new features, I suggest always applying such transformations over the whole data (training + testing) when working with categorical variables. Let’s say that you have a factor named
Piewith levelsApple,Pecan,CherryandPumpkim. Could be the case that your training data-set has only entries with levelsAppleandPecanand that your testing data-setCherryandPumpkim. On factor conversion,Piefrom training will have different levels fromPiefrom testing (incompatible). This will be a pain when you’re ready to train your classifier. Of course, you can always override variable factor levels for both data-sets afterwards, but it’s a somehow ackward operation.
classe More Descriptive“A”, “B”, “C”, etc, are not very descriptive, so I’m going to add a column named classe_description:
add_classe_description <- function(data) {
descriptions <- data.frame(classe = "A", classe_description = "Sitting (Still)")
descriptions <-rbind(descriptions, data.frame(classe = "B", classe_description = "Sitting Down"))
descriptions <-rbind(descriptions, data.frame(classe = "C", classe_description = "Standing (Still)"))
descriptions <-rbind(descriptions, data.frame(classe = "D", classe_description = "Standing Up"))
descriptions <-rbind(descriptions, data.frame(classe = "E", classe_description = "Walking"))
descriptions$classe <- as.factor(descriptions$classe)
data$classe <- as.factor(data$classe)
return(left_join(data, descriptions, by = c("classe")))
}
all <- add_classe_description(all)
training <- extract_data_by_record_id(all, training)
testing <- extract_data_by_record_id(all, testing)
Before we start choosing features for all models, let’s begin with graphical exploration. In the next sections we will develop some tools for this purpose.
These are some custom plotting functions I use in my notebooks. I find useful to extract plotting code to functions for re-usability sake:
custom_density_plot <- function(data, x_column, color_column) {
ggplot(data, aes_string(x = x_column, color = color_column)) +
geom_density(na.rm = TRUE)
}
custom_violin_plot <- function(data, x_column, color_column) {
ggplot(data, aes_string(color_column, x_column)) +
geom_violin(aes_string(fill = color_column)) +
geom_point(position = position_jitter(width = 0.2), color = "blue", alpha = 0.2)
}
custom_scatter_plot <- function(data, x_column, y_column, color_column) {
ggplot(data, aes_string(x = x_column, y = y_column, color = color_column)) +
geom_point()
}
custom_count_barchart <- function(data, column, title) {
ggplot(data, aes_string(x=column)) +
geom_bar(fill = "blue", alpha = 0.2) +
geom_text(stat='count', aes(label=sprintf("%d%% (%d)", round(..count.. * 100/sum(..count..), 0), ..count..), vjust=0)) +
xlab(paste(column, "(", title, ")"))
}
correlation_matrix <- function(data) {
numeric_data <- data[, sapply(data,is.numeric)]
matrix <- round(cor(numeric_data), 2)
matrix[upper.tri(matrix)] <- NA
matrix <- melt(matrix, na.rm = TRUE)
return(matrix)
}
custom_correlation_heat_map <- function(data) {
matrix <- correlation_matrix(data)
ggplot(data = matrix, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1), name="Pearson Correlation") +
theme(axis.text.x = element_text(angle = 90)) +
coord_fixed()
}
Their names should be descriptive enough for you to deduce what they do. Let’s take a look at the scatter plots first.
I believe that makes more sense to break the data per subject (user_name) and not mix data from different human subjects:
jeremy_training <- training[training$user_name == "jeremy", ]
charles_training <- training[training$user_name == "charles", ]
jeremy_scatter <- custom_scatter_plot(training[training$user_name == "jeremy", ], "raw_timestamp", "total_accel_belt", "classe_description")
charles_scatter <- custom_scatter_plot(training[training$user_name == "charles", ], "raw_timestamp", "total_accel_belt", "classe_description")
grid.arrange(jeremy_scatter, charles_scatter, ncol = 2)
I had a somehow naive expectation that I would be able to see some reminiscence of a continuous curve (given that these are time series, raw_timestamp is in the x axis), but the sensors don’t seem that refined and seem to provide essentially discrete measurements.
From the scatter plots alone, you have the impression that total_accel_belt would not be a great predictor given that its values seem to almost fully overlap for all categories of classe (we can’t see patches of different colors, each one representing a different classe).
You will see that this is not the case though, the variable is a fairly good predictor (not as good as the remaining ones though). Scatter plots are not the best tool for finding patterns in this particular data though. Despite of that, I’m going to plot another variable (total_accel_forearm, which seems to present the least amount of overlap) just to show a better case:
jeremy_scatter <- custom_scatter_plot(jeremy_training, "raw_timestamp", "total_accel_forearm", "classe_description")
charles_scatter <- custom_scatter_plot(charles_training, "raw_timestamp", "total_accel_forearm", "classe_description")
grid.arrange(jeremy_scatter, charles_scatter, ncol = 2)
Given that these measurements are not as smooth as I was expecting and we can easily identify any patterns, density plots might be more suitable for the job. I’m also going to adopt “jeremy” as the user for plot analysis from this point on:
density_total_accel_forearm <- custom_density_plot(jeremy_training, "total_accel_forearm", "classe_description")
violin_total_accel_forearm <- custom_violin_plot(jeremy_training, "total_accel_forearm", "classe_description")
grid.arrange(density_total_accel_forearm, violin_total_accel_forearm, ncol = 2)
You should be familiar with density plots, but a good alternative are violin plots, which can be seem as “augmented box plots”. As in box plots, the region outside the violin box represents the outliers, while frequent values are kept inside. In violin box plots the width changes according with how frequent (or dense) a value is in the scale, while in box plots the box’s width is constant though (the box is square).
I feel like violin plots are better visual tools than density ones, so I’m going to use them from now on:
violin_total_accel_arm <- custom_violin_plot(jeremy_training, "total_accel_arm", "classe_description")
violin_total_accel_belt <- custom_violin_plot(jeremy_training, "total_accel_belt", "classe_description")
violin_total_accel_dumbbell <- custom_violin_plot(jeremy_training, "total_accel_dumbbell", "classe_description")
grid.arrange(violin_total_accel_forearm, violin_total_accel_arm, violin_total_accel_belt, violin_total_accel_dumbbell, ncol = 2)
I have only plotted four variables, one variable per violin plot. You might have noticed that the box shapes change quite a bit when we shift from one classe to another in a plot. We are going to take advantage of these differences in data density patterns to predict classe for the testing data-set.
I’m going to use the correlation heat map we have introduced in the plotting utilities section:
custom_correlation_heat_map(training)
I must remind you that correlation matrices only work with numeric variables.
Notice that some high correlations are obvious like total_accel_dumbbell and accel_dumbell_x or accel_dumbell_y or accel_dumbell_z (its components).
You will also notice that the motion of the dumbbell is highly correlated with the motion of the forearm, which also makes sense.
In general, most measurements seem uncorrelated (the heat map is quite sparse), which is a good thing, means that most variables are not superfluous and therefore, I don’t see a reason to do principal component analysis. I don’t think we would get much benefit from it.
Some of the fields just won’t make good predictors, like the timestamps (raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp) and window variables (num_window and new_window) given that they obviously don’t correlate with classe (they are timestamps, indexes and control variables) and therefore we are going to remove them:
not_correlated_columns <- c("raw_timestamp_part_1", "raw_timestamp_part_2", "raw_timestamp", "cvtd_timestamp", "num_window", "new_window", "classe_description", "user_name")
training <- remove_columns(training, not_correlated_columns)
testing <- remove_columns(testing, not_correlated_columns)
all <- rbind.fill(training, testing)
I’m removing user_name as well. We don’t want this variable: The plan is to identify motion patterns for any individual, not motion patterns for particular individuals.
classe_description was just useful for analysis, so we will drop it as well.
A last step before we begin classification is converting classe to factor:
all$classe <- as.factor(all$classe)
training <- extract_data_by_record_id(all, training)
testing <- extract_data_by_record_id(all, testing)
That’s the last transformation required, so we can remove record_id:
training <- remove_columns(training, c("record_id"))
testing <- remove_columns(testing, c("record_id"))
In this section I’m going to try a few different classifier on the available data and compare how they perform using accuracy as a metric. You can extract these metrics from the confusion matrix, so I have created the following method for this purpose:
accuracy <- function(confusion) return(confusion$overall["Accuracy"])
One thing you will noticed with machine learning packages is that they don’t quite follow a common interface. For instance, when getting a prediction using the “randomForest” package, type should be set to “class”, while for the package “party”, type should be set to “response”. I want to avoid copying & pasting programming as much as possible, so I want functions that evaluate models and determine their accuracy that work for multiple packages. For this reason, I have defined a common interface for all models:
| function | description |
|---|---|
build(formula, data) |
Builds a model from a formula and training data. |
predictions(model, data) |
Generate predictions for the input data using the model. |
R doesn’t quite support OOP, but you can emulate some OOP principles like encapsulation, so I have encapsulated build() and predictions() inside an object. I also have included the attribute name, so we can easy identify the model.
In short, what you need to do is create a function that follows this template:
my_model_builder <- function() {
name <- "<name of my model builder"
build <- function(formula, data) "<my model builder code>"
predictions <- function(model, data) "<my predictions function>"
error_rate <- function(model) "<my OOB eror function>"
return(list(name=name, build=build, predictions=predictions))
}
In the following chunk, I have created builders for a number of different packages:
SEED <- 345
random_forest_builder <- function() {
name <- "Random Forest (randomForest package)"
build <- function(formula, data) return(randomForest(formula, data = data))
predictions <- function(model, data) return(predict(model, newdata = data, type = "class"))
return(list(name=name, build=build, predictions=predictions))
}
svm_builder <- function() {
name <- "SVM (e1071 package)"
build <- function(formula, data) return(svm(formula, data = data, probability=TRUE))
predictions <- function(model, data) return(predict(model, newdata = data, type = "class"))
return(list(name=name, build=build, predictions=predictions))
}
decision_tree_builder <- function() {
name <- "Decision Tree (rpart package)"
build <- function(formula, data) return(rpart(formula, data = data, method="class"))
predictions <- function(model, data) return(predict(model, newdata = data, type = "class"))
return(list(name=name, build=build, predictions=predictions))
}
random_forest_k_fold_cross_validation_builder <- function() {
name <- "Random Forest with K-Fold Cross Validation (caret package)"
build <- function(formula, data) {
control <- trainControl(method = "cv", number = 5)
return(train(formula, data = data,
method = "rf", trControl = control,
metric="Accuracy", importance = TRUE))
}
predictions <- function(model, data) return(predict(model, newdata = data, type = "raw"))
return(list(name=name, build=build, predictions=predictions))
}
Now we can implement a general function that can evaluate any model from any package as long as we build a builder object that follows this common interface:
wrap_string <- function(string) {
return(paste(strwrap(string, 50), collapse="\n"))
}
formula_name <- function(formula) {
name <- paste(format(formula), collapse = "")
name <- wrap_string(name)
return(name)
}
empty_evaluation <- function() {
empty <- data.frame(model = character(),
formula = character(),
accuracy = numeric(0),
error_rate = numeric(0),
elapsed_time_secs = numeric(0),
stringsAsFactors=FALSE)
return(empty)
}
evaluate_model <- function(formula, model_builder, prediction_column, train_data, cross_validation_data) {
tic()
builder <- model_builder()
model <- builder$build(formula, train_data)
predictions <- builder$predictions(model, cross_validation_data)
confusion <- confusionMatrix(data = predictions, reference = cross_validation_data[, prediction_column])
garbage <- capture.output(exec_time <- toc())
evaluation <- data.frame(model = builder$name,
formula = formula_name(formula),
accuracy = accuracy(confusion),
error_rate = 1 - accuracy(confusion),
elapsed_time_secs = (exec_time$toc - exec_time$tic),
stringsAsFactors=FALSE)
return(evaluation)
}
evaluate_models <- function(formulas, model_builders, prediction_column, train_data, cross_validation_data) {
evaluations <- empty_evaluation()
for (model_builder in model_builders) {
for (formula in formulas) {
set.seed(SEED)
evaluation <- evaluate_model(formula, model_builder, prediction_column, train_data, cross_validation_data)
evaluations <- bind_rows(evaluations, evaluation)
}
}
render_table(evaluations, digits = 6)
}
Armed with this evaluation function I can easily evaluate new models without having to change much code. All I have to do is create a builder object and add it to the list of models passed to evaluate_models() and I don’t need to modify the evaluation function in any way.
If you are not a programmer, this code design practice is called the open/closed principle.
Let’s try to evaluate a couple of models now. We need X to identify each record when generating the output, but X has no value as a predictor and therefore it has been excluded from the formula:
set.seed(SEED)
train_idx <- createDataPartition(training$classe, p = 0.6, list = FALSE, times = 1)
train_data <- training[train_idx, ]
cross_validation_data <- training[-train_idx, ]
formulas <- c(classe ~ . - X)
models <- c(random_forest_builder, svm_builder, decision_tree_builder, random_forest_k_fold_cross_validation_builder)
evaluate_models(formulas, models, "classe", train_data, cross_validation_data)
| model | formula | accuracy | error_rate | elapsed_time_secs | |
|---|---|---|---|---|---|
| 1 | Random Forest (randomForest package) | classe ~ . - X | 0.995284 | 0.004716 | 78.904000 |
| 2 | SVM (e1071 package) | classe ~ . - X | 0.939969 | 0.060031 | 174.169000 |
| 3 | Decision Tree (rpart package) | classe ~ . - X | 0.725593 | 0.274407 | 3.902000 |
| 4 | Random Forest with K-Fold Cross Validation (caret package) | classe ~ . - X | 0.992735 | 0.007265 | 2006.722000 |
Notice that I’m breaking the origin training data-set into a new training (train_data, let’s called “sub-training”) and cross validation data-sets. The error rate showed in the output is a good estimate of the out-of-sample error in the testing data-set, given that is calculated over cross validation data, which doesn’t share any records with the sub-training data-set.
You might notice that I have a single evaluation function that can evaluate any model for any package (as long a create a model builder object that follows the pre-defined interface). Also note that I’m breaking the training data into train_data and cross_validation_data, given that the provided testing data-set isn’t labeled.
From the evaluation results, random forest (from the “randomForest” package) performed the best. I have also used random forest with k-fold cross validation, which took almost 20 minutes to complete (“k” equal to 5) and performed worst than the first. I guess it could be improved by increasing k, but it would make the creation of the classifier much slower. It seems like the default values of ntree (500) and mtry (5) for the “randomForest” trainer do a good enough job.
This experiment makes uses of a range of different sensors (roll, pitch, yaw, acceleration, etc) located in different parts of the human subject’s body (arm, forearm, belt & dumbbell).
I was wondering about how much accuracy I could get from these sensors separately (for different body parts or types of measurement). That’s an interesting question: Do we really need that many sensors?
The random forest (package randomForest) seems quite fast, so I’m going to repeat the evaluation for different combinations of features.
belt_formula <- classe~roll_belt+pitch_belt+yaw_belt+total_accel_belt+
gyros_belt_x+gyros_belt_y+gyros_belt_z+
accel_belt_x+accel_belt_y+accel_belt_z+
magnet_belt_x+magnet_belt_y+magnet_belt_z
arm_formula <- classe~roll_arm+pitch_arm+yaw_arm+total_accel_arm+
gyros_arm_x+gyros_arm_y+gyros_arm_z+
accel_arm_x+accel_arm_y+accel_arm_z+
magnet_arm_x+magnet_arm_y+magnet_arm_z
forearm_formula <- classe~roll_forearm+pitch_forearm+yaw_forearm+total_accel_forearm+
gyros_forearm_x+gyros_forearm_y+gyros_forearm_z+
accel_forearm_x+accel_forearm_y+accel_forearm_z+
magnet_forearm_x+magnet_forearm_y+magnet_forearm_z
dumbbell_formula <- classe~roll_dumbbell+pitch_dumbbell+yaw_dumbbell+total_accel_dumbbell+
gyros_dumbbell_x+gyros_dumbbell_y+gyros_dumbbell_z+
accel_dumbbell_x+accel_dumbbell_y+accel_dumbbell_z+
magnet_dumbbell_x+magnet_dumbbell_y+magnet_dumbbell_z
formulas <- c(belt_formula, arm_formula, forearm_formula, dumbbell_formula)
models <- c(random_forest_builder)
evaluate_models(formulas, models, "classe", train_data, cross_validation_data)
| model | formula | accuracy | error_rate | elapsed_time_secs | |
|---|---|---|---|---|---|
| 1 | Random Forest (randomForest package) | classe ~ roll_belt + pitch_belt + yaw_belt + total_accel_belt + gyros_belt_x + gyros_belt_y + gyros_belt_z + accel_belt_x + accel_belt_y + accel_belt_z + magnet_belt_x + magnet_belt_y + magnet_belt_z | 0.905430 | 0.094570 | 23.291000 |
| 2 | Random Forest (randomForest package) | classe ~ roll_arm + pitch_arm + yaw_arm + total_accel_arm + gyros_arm_x + gyros_arm_y + gyros_arm_z + accel_arm_x + accel_arm_y + accel_arm_z + magnet_arm_x + magnet_arm_y + magnet_arm_z | 0.861840 | 0.138160 | 23.158000 |
| 3 | Random Forest (randomForest package) | classe ~ roll_forearm + pitch_forearm + yaw_forearm + total_accel_forearm + gyros_forearm_x + gyros_forearm_y + gyros_forearm_z + accel_forearm_x + accel_forearm_y + accel_forearm_z + magnet_forearm_x + magnet_forearm_y + magnet_forearm_z | 0.895106 | 0.104894 | 23.593000 |
| 4 | Random Forest (randomForest package) | classe ~ roll_dumbbell + pitch_dumbbell + yaw_dumbbell + total_accel_dumbbell + gyros_dumbbell_x + gyros_dumbbell_y + gyros_dumbbell_z + accel_dumbbell_x + accel_dumbbell_y + accel_dumbbell_z + magnet_dumbbell_x + magnet_dumbbell_y + magnet_dumbbell_z | 0.885419 | 0.114581 | 23.252000 |
roll_pitch_yaw_formula <- classe~roll_belt+pitch_belt+yaw_belt+
roll_arm+pitch_arm+yaw_arm+
roll_forearm+pitch_forearm+yaw_forearm+
roll_dumbbell+pitch_dumbbell+yaw_dumbbell
total_accel_formula <- classe~total_accel_belt+total_accel_arm+total_accel_forearm+total_accel_dumbbell
gyros_formula <- classe~gyros_belt_x+gyros_belt_y+gyros_belt_z+
gyros_arm_x+gyros_arm_y+gyros_arm_z+
gyros_forearm_x+gyros_forearm_y+gyros_forearm_z+
gyros_dumbbell_x+gyros_dumbbell_y+gyros_dumbbell_z
magnet_formula <- classe~magnet_belt_x+magnet_belt_y+magnet_belt_z+
magnet_arm_x+magnet_arm_y+magnet_arm_z+
magnet_forearm_x+magnet_forearm_y+magnet_forearm_z+
magnet_dumbbell_x+magnet_dumbbell_y+magnet_dumbbell_z
accel_formula <- classe~accel_belt_x+accel_belt_y+accel_belt_z+
accel_arm_x+accel_arm_y+accel_arm_z+
accel_forearm_x+accel_forearm_y+accel_forearm_z+
accel_dumbbell_x+accel_dumbbell_y+accel_dumbbell_z
formulas <- c(roll_pitch_yaw_formula, total_accel_formula, gyros_formula, magnet_formula, accel_formula)
evaluate_models(formulas, models, "classe", train_data, cross_validation_data)
| model | formula | accuracy | error_rate | elapsed_time_secs | |
|---|---|---|---|---|---|
| 1 | Random Forest (randomForest package) | classe ~ roll_belt + pitch_belt + yaw_belt + roll_arm + pitch_arm + yaw_arm + roll_forearm + pitch_forearm + yaw_forearm + roll_dumbbell + pitch_dumbbell + yaw_dumbbell | 0.985725 | 0.014275 | 19.548000 |
| 2 | Random Forest (randomForest package) | classe ~ total_accel_belt + total_accel_arm + total_accel_forearm + total_accel_dumbbell | 0.675631 | 0.324369 | 13.123000 |
| 3 | Random Forest (randomForest package) | classe ~ gyros_belt_x + gyros_belt_y + gyros_belt_z + gyros_arm_x + gyros_arm_y + gyros_arm_z + gyros_forearm_x + gyros_forearm_y + gyros_forearm_z + gyros_dumbbell_x + gyros_dumbbell_y + gyros_dumbbell_z | 0.881723 | 0.118277 | 27.022000 |
| 4 | Random Forest (randomForest package) | classe ~ magnet_belt_x + magnet_belt_y + magnet_belt_z + magnet_arm_x + magnet_arm_y + magnet_arm_z + magnet_forearm_x + magnet_forearm_y + magnet_forearm_z + magnet_dumbbell_x + magnet_dumbbell_y + magnet_dumbbell_z | 0.955774 | 0.044226 | 21.167000 |
| 5 | Random Forest (randomForest package) | classe ~ accel_belt_x + accel_belt_y + accel_belt_z + accel_arm_x + accel_arm_y + accel_arm_z + accel_forearm_x + accel_forearm_y + accel_forearm_z + accel_dumbbell_x + accel_dumbbell_y + accel_dumbbell_z | 0.938440 | 0.061560 | 22.220000 |
belt_forearm_formula <- classe~roll_belt+pitch_belt+yaw_belt+total_accel_belt+
gyros_belt_x+gyros_belt_y+gyros_belt_z+
accel_belt_x+accel_belt_y+accel_belt_z+
magnet_belt_x+magnet_belt_y+magnet_belt_z+
roll_forearm+pitch_forearm+yaw_forearm+total_accel_forearm+
gyros_forearm_x+gyros_forearm_y+gyros_forearm_z+
accel_forearm_x+accel_forearm_y+accel_forearm_z+
magnet_forearm_x+magnet_forearm_y+magnet_forearm_z
roll_pitch_yaw_magnet_formula <- classe~roll_belt+pitch_belt+yaw_belt+
roll_arm+pitch_arm+yaw_arm+
roll_forearm+pitch_forearm+yaw_forearm+
roll_dumbbell+pitch_dumbbell+yaw_dumbbell+
magnet_belt_x+magnet_belt_y+magnet_belt_z+
magnet_arm_x+magnet_arm_y+magnet_arm_z+
magnet_forearm_x+magnet_forearm_y+magnet_forearm_z+
magnet_dumbbell_x+magnet_dumbbell_y+magnet_dumbbell_z
formulas <- c(belt_forearm_formula, roll_pitch_yaw_magnet_formula)
evaluate_models(formulas, models, "classe", train_data, cross_validation_data)
| model | formula | accuracy | error_rate | elapsed_time_secs | |
|---|---|---|---|---|---|
| 1 | Random Forest (randomForest package) | classe ~ roll_belt + pitch_belt + yaw_belt + total_accel_belt + gyros_belt_x + gyros_belt_y + gyros_belt_z + accel_belt_x + accel_belt_y + accel_belt_z + magnet_belt_x + magnet_belt_y + magnet_belt_z + roll_forearm + pitch_forearm + yaw_forearm + total_accel_forearm + gyros_forearm_x + gyros_forearm_y + gyros_forearm_z + accel_forearm_x + accel_forearm_y + accel_forearm_z + magnet_forearm_x + magnet_forearm_y + magnet_forearm_z | 0.987764 | 0.012236 | 40.589000 |
| 2 | Random Forest (randomForest package) | classe ~ roll_belt + pitch_belt + yaw_belt + roll_arm + pitch_arm + yaw_arm + roll_forearm + pitch_forearm + yaw_forearm + roll_dumbbell + pitch_dumbbell + yaw_dumbbell + magnet_belt_x + magnet_belt_y + magnet_belt_z + magnet_arm_x + magnet_arm_y + magnet_arm_z + magnet_forearm_x + magnet_forearm_y + magnet_forearm_z + magnet_dumbbell_x + magnet_dumbbell_y + magnet_dumbbell_z | 0.993500 | 0.006500 | 34.357000 |
Not surprisingly, total_accel_formula didn’t fare well, given that taking the vector magnitude removes spacial information from acceleration, thus accel_formula did much better.
roll_pitch_yaw_magnet_formula (combining roll, pitch, yawn and magnetic measurements) seems to do almost as good in predicting classe as the full formula, with an accuracy of nearly 0.99.
belt_forearm_formula (combining belt and forearm measurements) also has accuracy close to 0.99.
If we were to put an actual bio-metric fitness product in the market (and I guess that was one of the motivations of the research), I believe that we would try to simplify our design as much as possible.
There’s also the matter of convenience: I’m guessing that most people are probably not willing to wear this many sensors.
builder <- random_forest_builder()
model <- builder$build(classe ~ . - X, training)
varImpPlot(model, type=2)
You may have noticed that the best predictors are roll, pitch, yaw and magnetic measurements, as we have pointed out in the previous section when evaluating models with different combinations of features.
Let’s generate an output file with our predictions now:
remove_non_alphanumeric <- function(string) {
return(str_replace_all(string, "[^\\w]", ""))
}
output_name <- function(formula, builder_name) {
output <- paste0(builder_name, formula_name(formula))
output <- remove_non_alphanumeric(output)
output <- paste0(output, ".csv")
return(output)
}
generate_predictions <- function(formula, model_builder, prediction_column, index_column, train_data, input_data) {
builder <- model_builder()
model <- builder$build(formula, train_data)
predictions <- builder$predictions(model, input_data)
input_data[, prediction_column] <- predictions
output <- input_data[, names(input_data) %in% c(index_column, prediction_column)]
write.csv(output, file = output_name(formula, builder$name), row.names = FALSE)
return(input_data)
}
output <- generate_predictions(classe ~ . - X, random_forest_builder, "classe", "X", training, testing)
Let’s compare the counts for the training and testing data-set as a sanity check:
training_count <- custom_count_barchart(training, "classe", "training")
testing_count <- custom_count_barchart(output, "classe", "testing")
grid.arrange(training_count, testing_count, ncol = 2)
I was expecting similar counts (percentages), but it’s not necessarily the case that the training and testing data-sets where randomly split from a single whole data-set. If you recall, X (record ID) for the testing data-set overlaps with the ones for the training data-set suggesting that they were generated by independent experiments with the same setup.
An interesting thing to notice about our data is the amount of measurements taken (accelerations, angular velocities, etc). One would think that tracking a single variable over time (position for instance) would be enough to deduce classe.
Also, looking at the scatter plots you will notice that the data is not that smooth and is quite noisy.
In practice measurement apparatus are never perfect and the best ones might be quite expensive. You get more bang for your buck by using many low quality inexpensive apparatus than few high quality expensive ones. Even though the quality of our measurements is not that great, we can compensate by taking more measurements (from different sources) and still do very well in the classification.